GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
نویسندگان
چکیده
With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. Thus, the “big geospatial data management” issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the WWW, users cannot find resources of their interests efficiently. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and Environmental Systems Research Institute, Inc (ESRI) Shapefiles. In addition, we apply the distributed computing concept to promote the performance of the GeoWeb Crawler. The result shows that for 10 targeted resources types, the GeoWeb Crawler discovered 7351 geospatial services and 194,003 datasets. As a result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide a comprehensive index of GeoWeb.
منابع مشابه
Automatic Generation of Geospatial Metadata for Web Resources
Web resources that are not part of any Spatial Data Infrastructure can be an important source of information. However, the incorporation of Web resources within a Spatial Data Infrastructure requires a significant effort to create metadata. This work presents an extensible architecture for an automatic characterisation of Web resources and a strategy for assignation of their geographic scope. T...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملTarantula - A Scalable and Extensible Web Spider
Web crawlers today suffer from poor navigation techniques which reduce their scalability while crawling the World Wide Web (WWW). In this paper we present a web crawler named Tarantula that is scalable and fully configurable. The work on Tarantula project was started with the aim of making a simple, elegant and yet an efficient Web Crawler offering better crawling strategies while walking throu...
متن کاملFast and Scalable Pattern Mining for Media-Type Focused Crawling
Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naı̈ve crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data ...
متن کاملDiscovering Land Cover Web Map Services from the Deep Web with JavaScript Invocation Rules
Automatic discovery of isolated land cover web map services (LCWMSs) can potentially help in sharing land cover data. Currently, various search engine-based and crawler-based approaches have been developed for finding services dispersed throughout the surface web. In fact, with the prevalence of geospatial web applications, a considerable number of LCWMSs are hidden in JavaScript code, which be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- ISPRS Int. J. Geo-Information
دوره 5 شماره
صفحات -
تاریخ انتشار 2016